Project option 4 - Forecast Metro train delays in and around NYC: An amazing new dataset has popped up on Kaggle recently that list origin/destinations delays for Amtrak and NJ Transit trains. Can you predict train delays? Consider the time frame that it would be useful to have such predictions. Predicting 5 minutes out is not going to be as useful as 2-3 hours out. Consider training on a month and predicting for the next week or two. Consider time/space (train line, county etc.) cross validation. Many app use cases here.
Although train delays may seem like minor annoyances in the daily life of a commuter, there are real economic costs to train tardiness. The city of New York Office of the Controller found that major delays (5 minute delays) annually cost the city $170.2 million dollars (Stringer, 2017). Although this study was done on subway delays, the same idea can be applied to regional trains as the NJTransit has a history of delays (McGeehan, 2018). The algorithm that we have developed predicts how late trains will be based on a variety of variables, and can be utilized by agencies, researchers, and the average commuter.
For agencies and researchers, the visualizations and algorithm can be a tool to identify where trains are late. Whether that be the stations that are coming from or arriving to or if there is a particular day of the week that is most late. This analysis will provide ways to identify lateness so that it can be addressed. For the average commuter, this analysis can provide a way to determine average lateness and given a set of variables; how late will the train be?
remove <- c("ID","from_id","to_id","train_id","type","interval60","line","date","scheduled_datetime",
"actual_datetime","Actual_time","accumulated_delay", "quintile_delay", "AM_PM", "status")
ColNamesClean <- as.tibble(colnames(df %>%
select(!remove))) %>%
rename(Name = value) %>%
mutate(Name = str_to_title(Name),
Name = str_replace(Name,"_"," "))
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(remove)` instead of `remove` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
datatable(df_shp %>%
st_drop_geometry() %>%
select(!remove) %>%
sample_n(500),
class = 'cell-border stripe',
rownames = FALSE,
colnames = ColNamesClean$Name,
caption = "Table 1: Table view of the data",
filter = 'bottom',
options = list(pageLength = 10))
This model uses data from Kaggle on Amtrak and NJTransit, shapefile data from ESRI, and weather data from riem. For the data from kaggle we used data specifically from NJtransit. The NJtransit is the public transit system for the state of New Jersey and includes buses, trains, and light rail; however, for this study we are only finding delays for trains. The NJtransit has 11 different train lines and 165 stations. The dataset that we have selected comprises of 870025 and 30 variables. From this dataset, we engineered a number of other variables such as station lag, the delay at the previous station, and the accumulated delay at each stop sequence. In addition to the train data, we also included data on the weather, at the time of the train departing. Weather data was pulled from the nearest station which was Newark International Airport.
Table 1 includes a sample of 500 rows to view the data.
To conceptualize what variables to consider in the model, we explored the data through plot correlations, categorical data, delays by time, delays by station, geography and delays over time.
To explore the correlation in data, we used the DataExplorer package’s plot_correlation() function. This package also plots correlation of categorical data by breaking categorical data into dummy variables. The plot below shows the relationship between all the variables.
To better understand the relationships in categorical data, in the plot below the average delay was plotted for AM/PM, line, scheduled time of day, and status of the train. The plot does indicate that There is not a large difference between trains based on AM/PM; however, there are differences in delay based on the line.
Next we wanted to understand the relationship of lateness over the course of 24 hours. First we categorized the scheduled trains by AM Rush, Mid-Day, PM Rush, and Overnight, and then counted the frequencies of how many trains were late during each of these periods. Next the overall line delays in the dataset were counted and plotted against the hour in the day. This shows at what time of day how many trains were late.
A <- df %>%
mutate(delay_grtr5min = case_when(delay_minutes >= 5 ~ 1,
delay_minutes < 5 ~ 0)) %>%
filter(delay_grtr5min == 1) %>%
group_by(interval60, scheduled_TOD, LINE_NAME) %>%
tally() %>%
summarize(avg_trainsLate = mean(n)) %>%
ggplot() +
geom_histogram(aes(x = avg_trainsLate), binwidth = 1) +
facet_wrap(~factor(scheduled_TOD,levels=c("AM Rush", "Mid-Day", "PM Rush", "Overnight"))) +
plotTheme() +
labs(title = "Delays by Time of Day",
x = "Number of Trains Late",
caption = "Figure X")
B <- df %>%
mutate(delay_grtr5min = case_when(delay_minutes >= 5 ~ 1,
delay_minutes < 5 ~ 0)) %>%
filter(delay_grtr5min == 1) %>%
group_by(interval60) %>%
ggplot() +
geom_freqpoly(aes(hour(scheduled_datetime), color = LINE_NAME), size = 1, binwidth = 1) +
plotTheme() +
labs(title = "Count of Lateness by Line by Time",
x = "Scheduled Hour of Day",
caption = "Figure X")
grid.arrange(A,B, ncol = 2, nrow=1, widths = c(1.5,2),
top = textGrob("Plots showing Usage Over Time", gp=gpar(fontsize=20,font=3)))
Next we analyzed day within the time of the year, rather than hours within a day. Figure X shows the amount of delay over the course of a year by day. The dotted line in the figure indicates the first day covid-19 was detected in the United States. Based on the data, it is difficult to determine the role that covid-19 had on average delay times. To better understand the role of covid-19, data from year prior to covid will need to be used in a t-test.
#df %>%
# filter(date == as.numeric(ymd(c("2020-01-20"))))
covid_day <- 691 # for full dataset
covid_day <- 20
animate(
df_shp %>%
st_drop_geometry() %>%
group_by(day,LINE_NAME) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
na.omit() %>%
ggplot(aes(x = day, y = mean_delay, color = LINE_NAME)) +
geom_line(se = FALSE) +
geom_vline(xintercept = covid_day, linetype = "dashed") +
labs(title = "Average Delay Over Time",
subtitle = "By Line",
y = "Average Delay",
x = "Day") +
plotTheme() +
theme(legend.position = "hide") +
facet_wrap(~LINE_NAME)+
transition_reveal(day),
width = 800, height = 800)
#anim_save(filename = "LINE_NAME.gif",animation = last_animation())
In addition to the animated plots, Figure X also includes average delays that has been smoothed using geom_smooth rather than geom_line. geom_smooth estimates a best fit line rather than directly plotting and following every single point.
ggplotly(df_shp %>%
st_drop_geometry() %>%
group_by(date,LINE_NAME) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
ggplot(aes(x = date, y = mean_delay, color = LINE_NAME)) +
geom_smooth(se = FALSE) +
geom_vline(xintercept = as.numeric(ymd(c("2020-01-20"))), linetype = "dashed") +
labs(title = "Average Delay by Day of Year",
subtitle = "By Line",
y = "Average Delay",
x = "Date",
caption = "Figure X") +
plotTheme())
#ggsave("smooth_line_delay.jpg", width = 16, height = 9, unit = "in", dpi = 300)
Finally, figure X, includes an estimated line for delays by stop sequence. Stop sequence indicates where on the line this train has stopped; for example, the North East Corridor may have multiple stops and those stops are organized by values 1, 2, 3, and so forth.
factor_order <- c(1:26)
animate(df_shp %>%
st_drop_geometry() %>%
group_by(day,stop_sequence) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
mutate(stop_sequence = factor(stop_sequence, levels = factor_order)) %>%
ggplot(aes(x = day, y = mean_delay, color = stop_sequence)) +
geom_line(se = FALSE) +
geom_vline(xintercept = covid_day, linetype = "dashed") +
labs(title = "Average Delay by Day of Year",
subtitle = "By Line",
y = "Average Delay",
x = "Day") +
plotTheme() +
facet_wrap(~stop_sequence) +
theme(legend.position = "hide",
axis.ticks.x = element_blank()) +
transition_reveal(day),
width = 800, height = 800)
#anim_save(filename = "stop_sequence.gif",animation = last_animation())
Similarly to the analysis by Line Name, figure x below is a smoothed out using geom_smooth rather than geom_line
ggplotly(df_shp %>%
st_drop_geometry() %>%
group_by(date,stop_sequence) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
mutate(stop_sequence = factor(stop_sequence, levels = factor_order)) %>%
ggplot(aes(x = date, y = mean_delay, color = stop_sequence)) +
geom_smooth(se = FALSE) +
geom_vline(xintercept = as.numeric(ymd(c("2020-01-20"))), linetype = "dashed") +
labs(title = "Average Delay Over Time",
subtitle = "By Stop Sequence",
y = "Average Delay",
x = "Date",
caption = "Figure X") +
plotTheme())
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#ggsave("smooth_Stop_delay.jpg", width = 16, height = 9, unit = "in", dpi = 300)
Rather then analyze delays by time of day, we also wanted to understand the role that stations had in delay times for trains. Below we included a plot that shows the average delay to and from a station.
ggplotly(ggplot(data = cat %>%
filter(Variable %in% c("from","to")) %>%
arrange(Value) %>%
mutate(Value = as.factor(Value)),
aes(x = Value, y = delay_minutes, fill = Variable)) +
geom_bar(position = "dodge2", stat = "summary", fun.y = "mean") +
scale_y_continuous(labels = comma) +
plotTheme() +
labs(x = "Station Names",
y = "Delay in Minutes",
title = "Cateogrical Variables Average Delay",
caption = "Figure X") +
theme(axis.title.x = element_blank()) +
coord_flip())
#ggsave("Station_Delay.jpg", width = 16*1.5, height = 9*1.5, unit = "in", dpi = 300)
In addition to analysis on Time, this study also includes space. The maps below of the NJ transit lines show the average delow going to and from each station on the NJtransit
A <- df_shp %>%
st_drop_geometry() %>%
group_by(from) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
mutate(Bin_mean_delay = cut_interval(mean_delay, n = 20, dig.lab = 10))
A <- length(unique(A$Bin_mean_delay))
cc <- scales::seq_gradient_pal("Red","Grey")(seq(0,1,length.out=A))
#df_shp %>%
# st_drop_geometry() %>%
# group_by(from) %>%
# summarize(mean_delay = mean(delay_minutes)) %>%
# mutate(Bin_mean_delay = cut_interval(mean_delay, n = 20, dig.lab = 10),
# Bin_mean_delay = str_remove(Bin_mean_delay,"."),
# Bin_mean_delay = str_sub(Bin_mean_delay,1,nchar(Bin_mean_delay)-1)) %>%
# separate(Bin_mean_delay, into = c("Start","End"), sep = ",") %>%
# mutate(Start = round(as.numeric(Start),3),
# End = round(as.numeric(End),3)) %>%
# arrange(Start) %>%
# mutate(Start = as.character(Start),
# End = as.character(End),
# Interval = paste0(Start," - ",End),
# Interval = forcats::as_factor(Interval)) %>%
# rename(STATION_ID = from) %>%
# left_join(NJtransitstations, by = "STATION_ID") %>%
# st_as_sf() %>%
# ggplot() +
# geom_sf(data = NJTransitLines) +
# geom_sf(aes(color = Interval)) +
# geom_sf(data = NJshp, fill = "transparent") +
# scale_color_manual(values = cc) +
# labs(title = "Delay in minutes going from a Station") +
# mapTheme()
B <- df_shp %>%
st_drop_geometry() %>%
group_by(from) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
rename(STATION_ID = from) %>%
left_join(NJtransitstations, by = "STATION_ID") %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = NJTransitLines) +
geom_sf(aes(color = mean_delay)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "From a Station") +
scale_color_gradient(low = "green",
high = "red") +
theme(legend.position = "bottom") +
mapTheme()
#ggsave("Delay_from_station.jpg", width = 16, height = 9, unit = "in", dpi = 300)
A <- df_shp %>%
st_drop_geometry() %>%
group_by(to) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
mutate(Bin_mean_delay = cut_interval(mean_delay, n = 20, dig.lab = 10))
A <- length(unique(A$Bin_mean_delay))
cc <- scales::seq_gradient_pal("Red","Grey")(seq(0,1,length.out=A))
#df_shp %>%
# st_drop_geometry() %>%
# group_by(to) %>%
# summarize(mean_delay = mean(delay_minutes)) %>%
# mutate(Bin_mean_delay = cut_interval(mean_delay, n = 20, dig.lab = 10),
# Bin_mean_delay = str_remove(Bin_mean_delay,"."),
# Bin_mean_delay = str_sub(Bin_mean_delay,1,nchar(Bin_mean_delay)-1)) %>%
# separate(Bin_mean_delay, into = c("Start","End"), sep = ",") %>%
# mutate(Start = as.numeric(Start)) %>%
# arrange(Start) %>%
# mutate(Start = as.character(Start),
# Interval = paste0(Start," - ",End),
# Interval = forcats::as_factor(Interval)) %>%
# rename(STATION_ID = to) %>%
# left_join(NJtransitstations, by = "STATION_ID") %>%
# st_as_sf() %>%
# ggplot() +
# geom_sf(data = NJTransitLines) +
# geom_sf(aes(color = Interval)) +
# geom_sf(data = NJshp, fill = "transparent") +
# scale_color_manual(values = cc) +
# labs(title = "Delay in minutes going to a Station") +
# mapTheme()
#ggsave("Delay_to_station.jpg", width = 16*.5, height = 9*.5, unit = "in", dpi = 300)
C <- df_shp %>%
st_drop_geometry() %>%
group_by(to) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
rename(STATION_ID = to) %>%
left_join(NJtransitstations, by = "STATION_ID") %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = NJTransitLines) +
geom_sf(aes(color = mean_delay)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "To a Station") +
scale_color_gradient(low = "green",
high = "red") +
theme(legend.position = "bottom") +
mapTheme()
#ggsave("Delay_to_station.jpg", width = 9*.5, height = 16*.5, unit = "in", dpi = 300)
grid.arrange(C,B, ncol = 2, nrow=1,
top = textGrob("Average Delay To and From Station", gp=gpar(fontsize=20,font=3)))
set.seed(111)
#sample_size = round(nrow(df_shp %>%
# st_drop_geometry())*.6)
#trainIndex <- sample(seq_len(nrow(df_shp %>%
# st_drop_geometry())), size = sample_size)
#Train <- lmdf[trainIndex,]
#Valid <- lmdf[-trainIndex,]
remove <- c("ID","from_id","to_id","train_id","type","interval60","line","date","scheduled_datetime",
"actual_datetime","Actual_time","accumulated_delay", "quintile_delay", "AM_PM", "status")
Sample <- df %>%
select(!remove) %>%
# Filter out Princeton Dinky because it didn't run during parts of pandemic
filter(LINE_NAME != "Princeton Dinky")
Train <- filter(Sample, (week >= 1) & (week <= 15))
Test <- filter(Sample, (week >= 16))
# Test <- df %>%
# select(!remove) %>%
# filter(!ID %in% Sample$ID) %>%
# filter(from %in% Sample$from) %>%
# filter(to %in% Sample$to) %>%
# filter(LINE_NAME %in% Sample$LINE_NAME) %>%
# na.omit()
#to test use slice()
#
# Train <- Train %>%
# select(!ID)
#
# Test <- Test %>%
# select(!ID)
# Garbage Collector!!
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 4438354 237.1 7968126 425.6 7968126 425.6
## Vcells 130261438 993.9 234488667 1789.1 234487594 1789.0
After exploring the data, we built a linear model that includes stop_sequence, from, to, delay_minutes, week, DotW, Start_time, Year, station_lag, scheduled_TOD, day, Temperature, Precipitation, Wind_Speed, LINE_NAME. This model in comparison to a model that includes all the variables is more advantageous because it avoids the problem of being overfit. When a model was built that included all the variables, the model had an R^2 value of 1 and 0 Mean Squared Error (MSE). The current model includes an MSE not equal to 0 and a R^2 value less than 1, which means the model is not overfit, and can predict on new data.
#reg.training <- fastLm(delay_minutes ~ ., data = Train)
reg.training <- lm(delay_minutes ~ ., data = Train)
#summary(reg.training) Our R squared is 1. Definitely an overfit model. We'll see how well it predicts though.
f <- predict(reg.training, Test)
Test <- bind_cols(Test, as.tibble(f)) %>%
rename(Delay.Predict = value) %>%
mutate(Delay.Error = Delay.Predict - delay_minutes,
Delay.AbsError = abs(Delay.Predict - delay_minutes),
Delay.APE = (abs(Delay.Predict - delay_minutes)) / Delay.Predict)
mae <- mean(Test$Delay.AbsError, na.rm=TRUE)
#print(mae)
mape <- mean(Test$Delay.APE, na.rm=TRUE)
#print(mape)
# Table of regression outputs
#modelsummary(reg.training, output = "kableExtra", stars = TRUE, gof_omit = 'IC|Log|AIC|BIC')
stargazer(reg.training, type = 'html')
| Dependent variable: | |
| delay_minutes | |
| stop_sequence | -0.023*** |
| (0.001) | |
| fromAbsecon | 3.077*** |
| (0.214) | |
| fromAllendale | 1.224*** |
| (0.128) | |
| fromAllenhurst | 0.245 |
| (0.177) | |
| fromAnderson Street | 2.917*** |
| (0.147) | |
| fromAnnandale | 1.813*** |
| (0.267) | |
| fromAsbury Park | -0.208 |
| (0.146) | |
| fromAtco | 3.199*** |
| (0.897) | |
| fromAtlantic City Rail Terminal | 4.444*** |
| (0.905) | |
| fromAvenel | 1.014*** |
| (0.098) | |
| fromBasking Ridge | 0.596*** |
| (0.190) | |
| fromBay Head | -0.309 |
| (0.211) | |
| fromBay Street | 1.511*** |
| (0.117) | |
| fromBelmar | 0.108 |
| (0.170) | |
| fromBerkeley Heights | 2.249*** |
| (0.146) | |
| fromBernardsville | -2.223*** |
| (0.200) | |
| fromBloomfield | 0.655*** |
| (0.110) | |
| fromBoonton | 2.519*** |
| (0.177) | |
| fromBound Brook | -0.507*** |
| (0.150) | |
| fromBradley Beach | 0.092 |
| (0.197) | |
| fromBrick Church | 0.373*** |
| (0.100) | |
| fromBridgewater | -0.749*** |
| (0.148) | |
| fromBroadway Fair Lawn | 0.542*** |
| (0.133) | |
| fromCampbell Hall | 1.086*** |
| (0.191) | |
| fromChatham | 0.347*** |
| (0.113) | |
| fromCherry Hill | 3.102*** |
| (0.899) | |
| fromClifton | 2.108*** |
| (0.136) | |
| fromConvent Station | 0.231* |
| (0.120) | |
| fromCranford | -0.912*** |
| (0.113) | |
| fromDelawanna | 1.605*** |
| (0.130) | |
| fromDenville | 1.168*** |
| (0.128) | |
| fromDover | 2.008*** |
| (0.135) | |
| fromDunellen | -0.785*** |
| (0.136) | |
| fromEast Orange | -0.210** |
| (0.100) | |
| fromEdison | 1.169*** |
| (0.097) | |
| fromEgg Harbor City | 4.769*** |
| (0.900) | |
| fromElberon | 0.370*** |
| (0.118) | |
| fromElizabeth | 0.754*** |
| (0.089) | |
| fromEmerson | 2.603*** |
| (0.162) | |
| fromEssex Street | 1.518*** |
| (0.135) | |
| fromFanwood | -0.797*** |
| (0.125) | |
| fromFar Hills | 2.807*** |
| (0.207) | |
| fromGarfield | 1.274*** |
| (0.130) | |
| fromGarwood | -0.485*** |
| (0.123) | |
| fromGillette | 0.170 |
| (0.154) | |
| fromGladstone | 2.906*** |
| (0.225) | |
| fromGlen Ridge | 1.119*** |
| (0.115) | |
| fromGlen Rock Boro Hall | -0.071 |
| (0.132) | |
| fromGlen Rock Main Line | -0.174 |
| (0.137) | |
| fromHackettstown | 3.954*** |
| (0.257) | |
| fromHamilton | 3.270*** |
| (0.107) | |
| fromHammonton | 1.023*** |
| (0.193) | |
| fromHarriman | 2.606*** |
| (0.166) | |
| fromHawthorne | 1.352*** |
| (0.134) | |
| fromHazlet | 1.093*** |
| (0.124) | |
| fromHigh Bridge | 1.221*** |
| (0.292) | |
| fromHighland Avenue | 0.621*** |
| (0.106) | |
| fromHillsdale | 2.083*** |
| (0.159) | |
| fromHoboken | 0.865*** |
| (0.102) | |
| fromJersey Avenue | 1.934*** |
| (0.106) | |
| fromKingsland | 0.761*** |
| (0.116) | |
| fromLake Hopatcong | 0.918*** |
| (0.179) | |
| fromLebanon | 0.690*** |
| (0.249) | |
| fromLincoln Park | 2.769*** |
| (0.192) | |
| fromLinden | 1.801*** |
| (0.091) | |
| fromLindenwold | 2.096*** |
| (0.170) | |
| fromLittle Falls | 0.909*** |
| (0.172) | |
| fromLittle Silver | -0.184** |
| (0.090) | |
| fromLong Branch | 2.954*** |
| (0.154) | |
| fromLyndhurst | 0.768*** |
| (0.139) | |
| fromLyons | -0.599*** |
| (0.184) | |
| fromMadison | 0.502*** |
| (0.117) | |
| fromMahwah | 0.619*** |
| (0.131) | |
| fromManasquan | -1.364*** |
| (0.192) | |
| fromMaplewood | 0.052 |
| (0.106) | |
| fromMetropark | 1.353*** |
| (0.092) | |
| fromMetuchen | 0.986*** |
| (0.095) | |
| fromMiddletown NJ | -0.226*** |
| (0.063) | |
| fromMiddletown NY | 1.709*** |
| (0.205) | |
| fromMillburn | 0.117 |
| (0.104) | |
| fromMillington | 1.903*** |
| (0.173) | |
| fromMontclair Heights | -0.017 |
| (0.138) | |
| fromMontclair State U | 1.639*** |
| (0.139) | |
| fromMontvale | 2.121*** |
| (0.166) | |
| fromMorris Plains | 0.200 |
| (0.124) | |
| fromMorristown | 0.958*** |
| (0.122) | |
| fromMount Arlington | 1.446*** |
| (0.157) | |
| fromMount Olive | 1.602*** |
| (0.229) | |
| fromMount Tabor | 0.652*** |
| (0.129) | |
| fromMountain Avenue | 0.407*** |
| (0.135) | |
| fromMountain Lakes | 0.577*** |
| (0.154) | |
| fromMountain Station | 0.349*** |
| (0.107) | |
| fromMountain View | 0.302 |
| (0.193) | |
| fromMurray Hill | 0.003 |
| (0.133) | |
| fromNanuet | 2.670*** |
| (0.180) | |
| fromNetcong | 2.476*** |
| (0.206) | |
| fromNetherwood | -1.329*** |
| (0.129) | |
| fromNew Bridge Landing | 1.454*** |
| (0.145) | |
| fromNew Brunswick | 1.286*** |
| (0.098) | |
| fromNew Providence | 1.817*** |
| (0.123) | |
| fromNew York Penn Station | 1.474*** |
| (0.093) | |
| fromNewark Airport | 0.896*** |
| (0.089) | |
| fromNewark Broad Street | 0.418*** |
| (0.096) | |
| fromNewark Penn Station | 1.509*** |
| (0.088) | |
| fromNorth Branch | 3.107*** |
| (0.200) | |
| fromNorth Elizabeth | 1.291*** |
| (0.093) | |
| fromOradell | 1.899*** |
| (0.156) | |
| fromOrange | 0.151 |
| (0.102) | |
| fromOtisville | -0.233 |
| (0.214) | |
| fromPark Ridge | 2.693*** |
| (0.158) | |
| fromPassaic | -0.244* |
| (0.145) | |
| fromPaterson | 0.044 |
| (0.144) | |
| fromPeapack | -2.785*** |
| (0.217) | |
| fromPearl River | 1.704*** |
| (0.161) | |
| fromPennsauken | 1.835*** |
| (0.142) | |
| fromPerth Amboy | 0.747*** |
| (0.061) | |
| fromPhiladelphia | 3.132*** |
| (0.904) | |
| fromPlainfield | -0.262* |
| (0.139) | |
| fromPlauderville | 1.207*** |
| (0.130) | |
| fromPoint Pleasant Beach | 2.819*** |
| (0.233) | |
| fromPort Jervis | 3.405*** |
| (0.228) | |
| fromPrinceton Junction | 0.763*** |
| (0.101) | |
| fromRadburn Fair Lawn | 1.926*** |
| (0.128) | |
| fromRahway | 1.828*** |
| (0.088) | |
| fromRamsey Main St | 1.149*** |
| (0.131) | |
| fromRamsey Route 17 | 0.374*** |
| (0.125) | |
| fromRaritan | -0.775*** |
| (0.163) | |
| fromRed Bank | 1.956*** |
| (0.140) | |
| fromRidgewood | 1.467*** |
| (0.125) | |
| fromRiver Edge | 2.857*** |
| (0.157) | |
| fromRoselle Park | -0.305*** |
| (0.106) | |
| fromRutherford | 1.901*** |
| (0.116) | |
| fromSalisbury Mills-Cornwall | 0.054 |
| (0.181) | |
| fromSecaucus Lower Lvl | 2.194*** |
| (0.125) | |
| fromSecaucus Upper Lvl | 0.166* |
| (0.091) | |
| fromShort Hills | 0.387*** |
| (0.109) | |
| fromSloatsburg | 1.725*** |
| (0.145) | |
| fromSomerville | -0.158 |
| (0.158) | |
| fromSouth Amboy | 1.445*** |
| (0.107) | |
| fromSouth Orange | 0.403*** |
| (0.103) | |
| fromSpring Lake | 1.923*** |
| (0.215) | |
| fromSpring Valley | 2.454*** |
| (0.182) | |
| fromStirling | -0.693*** |
| (0.166) | |
| fromSuffern | 1.150*** |
| (0.132) | |
| fromSummit | 1.047*** |
| (0.107) | |
| fromTeterboro | 2.298*** |
| (0.140) | |
| fromTowaco | 0.030 |
| (0.188) | |
| fromTrenton | -0.869*** |
| (0.112) | |
| fromTuxedo | 0.281* |
| (0.159) | |
| fromUnion | 0.033 |
| (0.108) | |
| fromUpper Montclair | 0.520*** |
| (0.129) | |
| fromWaldwick | 0.036 |
| (0.131) | |
| fromWalnut Street | 0.740*** |
| (0.123) | |
| fromWatchung Avenue | 0.919*** |
| (0.126) | |
| fromWatsessing Avenue | 1.538*** |
| (0.105) | |
| fromWayne-Route 23 | 2.048*** |
| (0.182) | |
| fromWesmont | 0.996*** |
| (0.132) | |
| fromWestfield | -1.042*** |
| (0.117) | |
| fromWestwood | 2.401*** |
| (0.158) | |
| fromWhite House | -1.472*** |
| (0.220) | |
| fromWood Ridge | 1.961*** |
| (0.123) | |
| fromWoodbridge | 0.476*** |
| (0.091) | |
| fromWoodcliff Lake | 2.269*** |
| (0.163) | |
| toAbsecon | -2.532*** |
| (0.894) | |
| toAllendale | 1.020*** |
| (0.131) | |
| toAllenhurst | 1.351*** |
| (0.176) | |
| toAnderson Street | 0.672*** |
| (0.144) | |
| toAnnandale | 0.803*** |
| (0.269) | |
| toAsbury Park | 0.669*** |
| (0.146) | |
| toAtco | -1.073*** |
| (0.131) | |
| toAtlantic City Rail Terminal | -3.178*** |
| (0.184) | |
| toAvenel | -0.381*** |
| (0.108) | |
| toBasking Ridge | 3.771*** |
| (0.194) | |
| toBay Head | -3.405*** |
| (0.211) | |
| toBay Street | 0.113 |
| (0.121) | |
| toBelmar | 1.292*** |
| (0.170) | |
| toBerkeley Heights | 0.496*** |
| (0.149) | |
| toBernardsville | -1.539*** |
| (0.202) | |
| toBloomfield | -0.616*** |
| (0.114) | |
| toBoonton | 0.646*** |
| (0.180) | |
| toBound Brook | 1.809*** |
| (0.146) | |
| toBradley Beach | 1.095*** |
| (0.196) | |
| toBrick Church | 0.829*** |
| (0.105) | |
| toBridgewater | 1.749*** |
| (0.157) | |
| toBroadway Fair Lawn | -0.461*** |
| (0.135) | |
| toCampbell Hall | 0.477** |
| (0.193) | |
| toChatham | 0.400*** |
| (0.117) | |
| toCherry Hill | -0.282*** |
| (0.093) | |
| toClifton | 2.405*** |
| (0.139) | |
| toConvent Station | 0.199 |
| (0.124) | |
| toCranford | 1.500*** |
| (0.119) | |
| toDelawanna | 1.823*** |
| (0.134) | |
| toDenville | 0.264** |
| (0.132) | |
| toDover | -0.184 |
| (0.138) | |
| toDunellen | 1.351*** |
| (0.148) | |
| toEast Orange | 0.514*** |
| (0.105) | |
| toEdison | -0.151 |
| (0.104) | |
| toEgg Harbor City | 0.183 |
| (0.160) | |
| toElberon | 1.503*** |
| (0.118) | |
| toElizabeth | -0.179* |
| (0.095) | |
| toEmerson | 0.365** |
| (0.162) | |
| toEssex Street | 0.422*** |
| (0.142) | |
| toFanwood | 1.809*** |
| (0.130) | |
| toFar Hills | 3.949*** |
| (0.210) | |
| toGarfield | 0.974*** |
| (0.134) | |
| toGarwood | 1.368*** |
| (0.125) | |
| toGillette | 1.231*** |
| (0.159) | |
| toGladstone | 2.616*** |
| (0.227) | |
| toGlen Ridge | 0.115 |
| (0.119) | |
| toGlen Rock Boro Hall | 0.684*** |
| (0.135) | |
| toGlen Rock Main Line | 0.805*** |
| (0.139) | |
| toHackettstown | -2.431*** |
| (0.260) | |
| toHamilton | 0.653*** |
| (0.110) | |
| toHammonton | -2.700*** |
| (0.889) | |
| toHarriman | 2.335*** |
| (0.169) | |
| toHawthorne | 1.745*** |
| (0.137) | |
| toHazlet | 1.073*** |
| (0.124) | |
| toHigh Bridge | -3.243*** |
| (0.286) | |
| toHighland Avenue | 0.947*** |
| (0.111) | |
| toHillsdale | -0.110 |
| (0.161) | |
| toHoboken | -1.953*** |
| (0.106) | |
| toJersey Avenue | -1.313*** |
| (0.114) | |
| toKingsland | 1.092*** |
| (0.121) | |
| toLake Hopatcong | -1.600*** |
| (0.182) | |
| toLebanon | 0.164 |
| (0.243) | |
| toLincoln Park | 0.692*** |
| (0.196) | |
| toLinden | 0.714*** |
| (0.097) | |
| toLindenwold | -0.843 |
| (0.887) | |
| toLittle Falls | -0.585*** |
| (0.164) | |
| toLittle Silver | -0.174* |
| (0.090) | |
| toLong Branch | -1.356*** |
| (0.154) | |
| toLyndhurst | 0.709*** |
| (0.141) | |
| toLyons | -0.954*** |
| (0.186) | |
| toMadison | 0.680*** |
| (0.122) | |
| toMahwah | 1.933*** |
| (0.133) | |
| toManasquan | -1.653*** |
| (0.191) | |
| toMaplewood | 0.673*** |
| (0.109) | |
| toMetropark | 0.511*** |
| (0.098) | |
| toMetuchen | -0.113 |
| (0.099) | |
| toMiddletown NJ | -0.435*** |
| (0.063) | |
| toMiddletown NY | 1.819*** |
| (0.204) | |
| toMillburn | 0.639*** |
| (0.112) | |
| toMillington | 1.678*** |
| (0.177) | |
| toMontclair Heights | 0.704*** |
| (0.139) | |
| toMontclair State U | -0.108 |
| (0.146) | |
| toMontvale | -0.489*** |
| (0.165) | |
| toMorris Plains | -0.003 |
| (0.127) | |
| toMorristown | 0.950*** |
| (0.126) | |
| toMount Arlington | 0.281* |
| (0.163) | |
| toMount Olive | -1.638*** |
| (0.231) | |
| toMount Tabor | -0.178 |
| (0.133) | |
| toMountain Avenue | 0.773*** |
| (0.139) | |
| toMountain Lakes | -1.117*** |
| (0.168) | |
| toMountain Station | 0.560*** |
| (0.111) | |
| toMountain View | -0.584*** |
| (0.192) | |
| toMurray Hill | -1.010*** |
| (0.138) | |
| toNanuet | 1.115*** |
| (0.177) | |
| toNetcong | 0.207 |
| (0.207) | |
| toNetherwood | 1.509*** |
| (0.136) | |
| toNew Bridge Landing | -0.620*** |
| (0.156) | |
| toNew Brunswick | 0.373*** |
| (0.101) | |
| toNew Providence | -0.195 |
| (0.126) | |
| toNew York Penn Station | -0.912*** |
| (0.097) | |
| toNewark Airport | 0.561*** |
| (0.094) | |
| toNewark Broad Street | 1.477*** |
| (0.101) | |
| toNewark Penn Station | 0.240** |
| (0.094) | |
| toNorth Branch | 1.579*** |
| (0.192) | |
| toNorth Elizabeth | 0.401*** |
| (0.098) | |
| toOradell | -1.137*** |
| (0.163) | |
| toOrange | 0.315*** |
| (0.107) | |
| toOtisville | 0.040 |
| (0.216) | |
| toPark Ridge | 0.642*** |
| (0.163) | |
| toPassaic | 0.280* |
| (0.147) | |
| toPaterson | 0.561*** |
| (0.146) | |
| toPeapack | -1.041*** |
| (0.218) | |
| toPearl River | 0.504*** |
| (0.170) | |
| toPennsauken | -1.374 |
| (0.892) | |
| toPerth Amboy | 0.404*** |
| (0.061) | |
| toPhiladelphia | |
| toPlainfield | 1.850*** |
| (0.137) | |
| toPlauderville | 1.215*** |
| (0.134) | |
| toPoint Pleasant Beach | 2.145*** |
| (0.232) | |
| toPort Jervis | 3.954*** |
| (0.227) | |
| toPrinceton Junction | -0.757*** |
| (0.106) | |
| toRadburn Fair Lawn | 1.719*** |
| (0.131) | |
| toRahway | -0.806*** |
| (0.091) | |
| toRamsey Main St | 1.308*** |
| (0.132) | |
| toRamsey Route 17 | 1.538*** |
| (0.127) | |
| toRaritan | -0.494*** |
| (0.167) | |
| toRed Bank | 2.085*** |
| (0.139) | |
| toRidgewood | 2.046*** |
| (0.127) | |
| toRiver Edge | 0.592*** |
| (0.156) | |
| toRoselle Park | 2.147*** |
| (0.110) | |
| toRutherford | 1.160*** |
| (0.120) | |
| toSalisbury Mills-Cornwall | 1.588*** |
| (0.180) | |
| toSecaucus Lower Lvl | 1.262*** |
| (0.128) | |
| toSecaucus Upper Lvl | 1.326*** |
| (0.096) | |
| toShort Hills | 0.452*** |
| (0.111) | |
| toSloatsburg | 0.724*** |
| (0.148) | |
| toSomerville | 2.022*** |
| (0.159) | |
| toSouth Amboy | 1.010*** |
| (0.107) | |
| toSouth Orange | 0.783*** |
| (0.109) | |
| toSpring Lake | 1.788*** |
| (0.215) | |
| toSpring Valley | -1.454*** |
| (0.190) | |
| toStirling | -0.973*** |
| (0.168) | |
| toSuffern | 0.445*** |
| (0.130) | |
| toSummit | -0.123 |
| (0.113) | |
| toTeterboro | 0.297** |
| (0.142) | |
| toTowaco | -1.571*** |
| (0.193) | |
| toTrenton | -3.741*** |
| (0.117) | |
| toTuxedo | 0.376** |
| (0.157) | |
| toUnion | 1.204*** |
| (0.112) | |
| toUpper Montclair | 0.610*** |
| (0.133) | |
| toWaldwick | 0.572*** |
| (0.133) | |
| toWalnut Street | -0.181 |
| (0.126) | |
| toWatchung Avenue | 0.407*** |
| (0.130) | |
| toWatsessing Avenue | 0.326*** |
| (0.110) | |
| toWayne-Route 23 | 0.594*** |
| (0.186) | |
| toWesmont | 0.415*** |
| (0.134) | |
| toWestfield | 1.872*** |
| (0.121) | |
| toWestwood | 0.350** |
| (0.166) | |
| toWhite House | -0.640*** |
| (0.224) | |
| toWood Ridge | 1.024*** |
| (0.126) | |
| toWoodbridge | 0.189* |
| (0.101) | |
| toWoodcliff Lake | 0.409** |
| (0.167) | |
| week | -0.041*** |
| (0.011) | |
| DotW | 0.003 |
| (0.002) | |
| Start_time | 0.00000*** |
| (0.00000) | |
| Year | |
| station_lag | 0.970*** |
| (0.0004) | |
| scheduled_TODMid-Day | -0.065*** |
| (0.009) | |
| scheduled_TODOvernight | -0.064*** |
| (0.009) | |
| scheduled_TODPM Rush | -0.095*** |
| (0.010) | |
| day | 0.004*** |
| (0.002) | |
| Temperature | 0.002*** |
| (0.0003) | |
| Precipitation | -0.010 |
| (0.015) | |
| Wind_Speed | 0.002*** |
| (0.0005) | |
| LINE_NAMEBergen County Line | -1.029*** |
| (0.115) | |
| LINE_NAMEGladstone Branch | 0.174*** |
| (0.060) | |
| LINE_NAMEMain Line | -0.979*** |
| (0.115) | |
| LINE_NAMEMontclair-Boonton Line | 0.011 |
| (0.060) | |
| LINE_NAMEMorristown Line | 0.086 |
| (0.058) | |
| LINE_NAMENJ Coast Line | -0.283*** |
| (0.051) | |
| LINE_NAMENortheast Corridor | -0.366*** |
| (0.051) | |
| LINE_NAMEPascack Valley Line | -1.368*** |
| (0.120) | |
| LINE_NAMERaritan Valley Line | |
| Constant | -0.712*** |
| (0.128) | |
| Observations | 642,197 |
| R2 | 0.929 |
| Adjusted R2 | 0.929 |
| Residual Std. Error | 2.163 (df = 641853) |
| F Statistic | 24,579.300*** (df = 343; 641853) |
| Note: | p<0.1; p<0.05; p<0.01 |
To ensure that our model works well, we cross-validated the data which compares the estimated delay to the actual delay.
fitControl <- trainControl(method = "cv", number = 5)
set.seed(825)
reg.cv <-
train(delay_minutes ~ ., data = Train,
method = "lm", trControl = fitControl, na.action = na.pass)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
reg.cv
## Linear Regression
##
## 698456 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 558766, 558765, 558764, 558765, 558764
## Resampling results:
##
## RMSE Rsquared MAE
## 2.152509 0.9295134 1.090011
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Test %>%
na.omit() %>%
group_by(LINE_NAME) %>%
summarize(Mean_Delay = mean(Delay.Predict)) %>%
left_join(NJTransitLines, by = "LINE_NAME") %>%
st_as_sf() %>%
ggplot() +
geom_sf(aes(color = Mean_Delay,
linetype = LINE_NAME)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "Estimated Delay by Line") +
scale_color_gradient(low = "green",
high = "red") +
mapTheme()
ggsave("Modeled_Delay_Line.jpg", width = 9*.5, height = 16*.5, unit = "in", dpi = 300)
B <- Test %>%
group_by(from) %>%
summarize(Mean_Delay = mean(Delay.Predict)) %>%
rename(STATION_ID = from) %>%
left_join(NJtransitstations %>%
select(STATION_ID,LATITUDE,LONGITUDE,geometry), by = "STATION_ID") %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = NJTransitLines) +
geom_sf(aes(color = Mean_Delay)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "From",
x = "Predicted Mean Delay") +
scale_color_gradient(low = "green",
high = "red") +
theme(legend.position = "bottom") +
mapTheme()
ggsave("Modeled_Delay_from_station.jpg", width = 9*.5, height = 16*.5, unit = "in", dpi = 300)
C <- Test %>%
group_by(to) %>%
summarize(Mean_Delay = mean(Delay.Predict)) %>%
rename(STATION_ID = to) %>%
left_join(NJtransitstations %>%
select(STATION_ID,LATITUDE,LONGITUDE,geometry), by = "STATION_ID") %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = NJTransitLines) +
geom_sf(aes(color = Mean_Delay)) +
geom_sf(data = NJshp, fill = "transparent") +
labs(title = "To",
x = "Predicted Mean Delay") +
scale_color_gradient(low = "green",
high = "red") +
theme(legend.position = "bottom") +
mapTheme()
ggsave("Modeled_Delay_to_station.jpg", width = 9*.5, height = 16*.5, unit = "in", dpi = 300)
grid.arrange(C,B, ncol = 2, nrow=1,
top = textGrob("Average of predicted Delay To and From Station", gp=gpar(fontsize=20,font=3)))
f <- df_shp %>%
st_drop_geometry() %>%
group_by(date, LINE_NAME) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
ggplot(aes(x = LINE_NAME, y = mean_delay)) +
geom_boxplot()
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
ggplotly(f)
#Delays by status, did it get cancelled, did it leave, etc.
df_shp %>%
st_drop_geometry() %>%
group_by(date,status) %>%
summarize(mean_delay = mean(delay_minutes)) %>%
ggplot(aes(x = date, y = mean_delay, color = status)) +
geom_smooth(se = FALSE) +
geom_vline(xintercept = as.numeric(ymd(c("2020-01-20"))), linetype = "dashed") +
labs(title = "Average Delay Over Time",
subtitle = "By Line",
y = "Average Delay",
x = "Date",
caption = "Figure X") +
plotTheme()
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'